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Abstract 

We construct new algorithms from scratch, which use the fourth order cumulant of stochastic 
variables for the cost function. The multiplicative updating rule here constructed is natural from 
the homogeneous nature of the Lie group and has numerous merits for the rigorous treatment of 
the dynamics. As one consequence, the second order convergence is shown. For the cost function, 
functions invariant under the componentwise scaling are choosen. By identifying points which can 
be transformed to each other by the scaling, we assume that the dynamics is in a coset space. In our 
method, a point can move toward any direction in this coset. Thus, no prewhitening is required. 



1 Introduction 



Suppose that A-dimensional stochastic variables {Aj|l < i < N} are observed. The independent 
component analysis (ICA) pursues a map X \— > Y, where each component of Y becomes mutually 
independent. In this letter we restrict ourselves to the linear independent component analysis. 
There we want to find a linear transformation C : X = (Ai, • • • , Xn)' ^ Y = (Y\, • • • , Yjv)' = CX. 
which minimizes some cost function that measures the independence. Hereafter we denote by the 
upper subscript / the transposition and by f the complex conjugate. 
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There can be many candidates for the cost function. For example the Kullback-Leibler infor- 
mation is a good measure for the independence. In this case the problem is translated to the 
minimization of — 5Zi=i / dyiPi(yi)ln -Pj(yj), where Pj is the probability density function of the 
i-th component. It is obvious that we must evaluate Pj's to find the optimal solution. A robust 
estimation of the probability density functions is not an easy task and if it is possible it may be 
computationally expensive. 

An alternative idea is to make use of the cumulant of the fourth order, or the 
kurtosis[ A.Hyvarinen,1997], which we will adopt in this letter. The fourth order cumulant van- 
ishes for the normal distribution. So, this cost function is robust under the gaussian random noises. 
We will construct algorithms where a matrix, which specifies the linear transformation, is updated 
by the left-multiplication of a matrix D = e A . This expression implies that D belongs to GL(N, R) 
(more accurately, the component of GL(N,R) connected to the unit element), which ensures the 
conservation of the rank. The specification of D by the coordinate A has many advantages since it 
has a compatibility with the homogeneous nature of the Lie group. 

There are variations for the form of the cost function. We will show our definitions in the 
following two sections, which are choosen to possess invariance under componentwise scaling. This 
invariance is crucial for a rigorous treatment of the convergence properties. Moreover, this invariance 
allows us to identify points in GL(N, R) which is transformed to each other by the scaling. Then 
we can legitimately restrict the dynamics to a coset space which is introduced by this identification. 

Under these settings, we determine A by using the Newton method for the second order ex- 
pansion of the cost function with respect to {Ajj}. It is assumed that the diagonal elements of A 
are zeros, which does not impose any restrictions. That is, a point can move toward any direction 
in this coset by a left-multiplication of e A . Thus it is not necesarry for our method to prewhiten 
the data. It is also not required that the optimal solution is the maximum or the minimum of the 
cost function. Indeed, the sole requirement is that the optimal point is a saddle point of the cost 
function since our method is in principle the Newton method. These are great advantages of our 
method. 

Our strategy is as follows. As an initial condition we set Co- For t > (t 6 N + ), we introduce 
an N x N matrix At and denote Ct as Ct = e *C$_i. Next, we evaluate the cost function at C% 
by using the expansion around Ct-\ with respect to the elements of At up to the second order. 
Then At is choosen as a saddle point of this second order expansion. We iteratively follow these 
procedures until we obtain a satisfactory solution. 

This letter is organized as follows. In Section ^ the main part of our algorithm is constructed, 
where the cost function is essentially identical to the sum of kurtoses. We adopt the square of the 
kurtoses for the cost function in Section ||. Explicit expressions for the optimal A (up to the second 
order) are obtained both in Sections || and ||. Section || is a short section where we show how each 
updating step is combined to obtain the optimal C. In Section || the convergence property of our 
algorithm is discussed. Section y contains conclusions and discussions. 



2 



2 Multiplicative update algorithm 



2.1 Expansion of the cost function 

Let us start by defining the cost function: 

f(C,X) = Y,fi(C,X) , 



(2.1) 



where /j's are the fourth order moments of components divided by the square of their variances, 

E((CX)f) 



fi(C,X) = 



(2.2) 



E{{cx)lf ■ 

In this letter we denote by E{A) the expectation of A. Obviously the cost function / coincides with 
the sum of kurtoses of all the components up to the constant. We set D = e A and expand f(D, Y) 
in terms of the elements of A. For example expansions term by term are evaluated as follows: 

E({DY)I) = E(y/) + 4^(A ip + (^) ip )S(y i %) + 6^A ip A^(y i %y 9 ) + 0(A 3 ) 



,A 



E{{DY)1) = E(Y?)+2Y,(^ P + (^ P )E(Y 1 Y p )+Y,^ P ^ q E(Y P Y q ) + 0(A 3 ) . (2.3) 

p p,q 

Hereafter we denote by 0(A k ) polynomials of matrix elements of A which does not contain terms 
with degrees less than k. For brevity's sake we introduce the following notations: 





(2.4) 




(2.5) 


TT (k,i) E ( Y i y p y q) 


(2.6) 


* = K (4) ) 4 /K (2) ) 4 • 


(2.7) 



and 



Using the quantities defined above we can show that the cost function is expanded as 



fi(D,Y) = 



A 



+ 4[(A + — )i? (3) ] ll + 6[AU^A>] + 0(A 3 ) 



1 - 4[(A + ^-)« (1) L - 2[AU^A'] .. + 12[ARW] 2 u + 0(A 3 
K t - 4[(A + - R^)] u + 2[A(3[/^) - mU^A'].. 



+ U Kt [AR^l-W[AR(% t [AR(% t + 0(A^ 



(2.8) 
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by straightforward calculations. Next, we evaluate partial derivatives of the cost function by the 
matrix elements of A. Partially differentiating (|2.8| ), we get an expression, 



df< f^ ,Y) = -A[K - R^] lk -2[(K- R^)A + A(K - R^)] t 



dA k i 1 ilk ^ i v j ii k 

+4 - Kk U^r)A>] lk + 24K lk [AflW] kk - lfifl£> [Ai^] fcfc - 16*g> [Afl«] 



+0(A 2 ) , (2.9) 
where K is an TV" x N matrix defined by 

K pq = K q R$ . (2.10) 

We want to decide A for which the partial derivative by A^; (k ^ I) of the cost function vanish 
on condition that An = for 1 < i < N. We neglect 0(A 3 ) terms in the cost function. Thus the 
right-hand side of (|2.9| ) is regarded as a polynomial of {A k i} of at most first order and it is always 
possible in principle to determine A which satifies the above condition. It is, at the same time, not 
easy to describe the problem in a form which is valid for arbitrary N. In the following subsection 
we will introduce a transparent and unified method for handling the partial derivatives of /. We 
leave this subsection by introducing N x N matrices 

yii) = 3U (2,i) _ KiU (0,i) (2.11) 

and 

Q = K - i? (3) (2.12) 

for later convenience. 



2.2 Expression by tensor product and determination of A 



The expression ( |2.9| ) is quite complicated and not convenient for our purpose, " determine A, where 
all the partial derivatives vanish" . Fortunately by mapping the relations between elements of N x N 
matrices to those of N 2 x TV 2 matrices, we can handle the problem transparently. Some preparations 
are needed. First, let us introduce a map cs: 



A 



( A xx 
V A N1 



A 



12 



Mat(iV,F) 
• A 1N \ 



Ann / 



N 2 



F 



^ cs(A) = (A u A 2 i ■■■ A N i A 12 A 22 



Ann)' 



(2.13) 
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where F is an unspecified field. We also introduce two useful operators T and P. The "intertwiner" 
T is an TV 2 x N 2 matrix defined by 



cs(A') = Tcs(A) for A G Mat(A, F) 



(2.14) 



The projection operator P, 



P = diag(pi,-- - ,p N 2) , 

p k = l for k = N(i - 1) + i, 1 < i < N 
Pk = otherwise , 

is used to extract the "diagonal" elements of a matrix from its image by cs. 
On this setting we can rewrite (|2.9|) as 



(2.15) 



df(e A ,Y) 



OA 



ki 



N 



- 4cs(Q) - 2[I N ®Q + T{I N ® Q')T] cs(A) + 4{ F (i) }cs(A') 

i=i 

+S.24(I N ® K)P(I ® - 16(/jv <& R {1) )P(I ® R {3) Y 
-16(/jv <8> R (3) )P(I ® # {1) )' jcs(A') 



(2.16) 



where Jw is the N x N unit matrix and 



/ y« o 

o y( 2 ) o 



Z+JV(fc— 1) 



\ 



\ 



y(^-i) 
v(*o J 



(2.17) 



We make use of the following fact: 
For X G Mat(JV, F) 

T(I N ® A)T = X ® Jjv . 
See Appendix |A| for the proof of ( |2.18j ). Then ( 2.16 ) becomes 

df(e A ,Y) 



(2.18) 



dA 



ki 



-4[cs(Q)] 



Z+JV(fc— 1) 



[Wcs(A)] 



J+JV(Jfc-l) 



(2.19) 
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where 



N 



W = -2(I N ®Q + Q f ®I N ) +4{0y«}T + 



i=i 



2A{I N ®K)P{I®R { ^)' 



-16(Jjv <g> R W )P(I <g> i? (3) )' - 16(/jv <g> R {3) )P(I <g> 



T . 



(2.20) 



Now let us determine A. Remember that we are going along the spirit of the Newton method. 
Thus we want to find A which satisfies the conditions 



df(e A ,Y) 



dA 



ki 



+ 0(A 2 ) for 1 < k, I < N, k^l 



and 



A 



for 1< k < N . 



(2.21) 



(2.22) 



The conditions ( |2.22| ) make the problem rather complicated one. Fortunately, by using P we can 
combine the conditions ( 2.2 1] ) and ( 2. 22] ) into a matrix equation : 



(I N 2 - P)W(I N 2 -P) + P cs(A) - 4(7^2 - P)cs(Q) = . 
Immediately it follows that 

cs(A) = 4\{I N 2 - P)W(I N 2 -P) + P] ~\l N 2 - P)cs(Q) . 



(2.23) 



(2.24) 



Thus we have obtained A which specify a saddle point of the expansion of f(C, Y) up to the second 
order. Note that quantities in the right-hand side of (|2.24|) are easily estimated ones from the 



observed data. So, an updating is determined by ( 2.24j) without any ambiguities 



3 Case I: square of kurtosis 

Obviously, points where kurtosis vanishes do not play any special role for the cost function / in 
Section [2|. The optimal solution, however, contains components with zero kurtoses when the number 
of the sources is less than that of the observation channels. Thus, in this section we treat with a 
slightly different cost function, which is the sum, 



f(C,X)=J2fi(C,X) , 



(3.1) 
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of the square of the kurtoses, 



fi(C,X) 



E((CX)f) 



(3.2) 



As in the last section, we want to know the saddle point D = e A of the expansion of f i{D, Y) in 
terms of {Ajj} up to the second order. We do not describe details of the calculations in this section, 
which is carried out almost in the same way as in Section ^. First, the expansion of fi{D,Y) is 
evaluated as 

f t (D,Y) = ( Ki -3) 2 -8[(A+^-)(rV Ki -RW)] u ( Ki -3) 

+4[A(3C/( 2 ^ - KiU^A']^ - 3) + lefA^ 1 )^ - R^)] \ 

+24( Kl - 3) Ki [ARW]l - 32( Kl - 3) [AR%. [AR^] U + 0(A 3 ) . (3.3) 

Next, we introduce N x N matrices K, {V^|l < i < N}, S, and Q defined respectively by 



K 



2Rpq( K q 3)K g , 



(3.4) 



V 



2(/e i -3)(3E/' (2><) -KiU^) , 



(3.5) 
(3.6) 



S = diag(2(^ - 3)) 



and 



Qpq — %{ K q ^)(Rpq K q Rpq) • 

We also rewrite Q in (|2.12j ) by q in order to avoid confusions: 

Qpq = (Rpq K q ~ Rpq ) ■ 



(3.7) 



(3.8) 



(3.9) 



Now we proceed to the expression by using the tensor product. We can show that the gradients of 
the cost function have the following expression: 



df(e A ,Y) 



dA 



hi 



-4[cs(Q)]i +JV ( fc -i) + [VTcs(A)] 



l+N(k-l) 



n+0(A 2 ), 



(3.10) 
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where 



N 



W = -2(l N ® Q + Q' ® I N ) + 4{0V (1) }T + 



i=i 



24(I N ® K)P(I ® B.W)' 



+32(I N ® q)P{I N ® g)' - 16(Jiv ® R {l) S)P{I ® i?( 3 ))' 
-16(J A r(8)i? (3) 5)P(J(8)i?W) / T. 



(3.11) 



This is a completely analogous expression to (2.19). Thus the coordinate A of the saddle point of 
the second order expansion is determined by 



cs(A) 



(I N 2-P)W(I N 2-P)+P (I N 2 - P)cs(Q) 



(3.12) 



In many cases obtained through the two cost functions in Section ^ and Section ||| are almost the 
same results. As implied at the beginning of this section, the main difference between these two lies 
in the points where the kurtosis of one of the components vanishes. These point indeed constitue 
saddle points of the cost function /, while it is impossible to capture them by the algorithm 
in Section ^. Thus, we must choose an appropriate method for individual problems having this 
differ nee in mind. 



4 Iteration of updating 

Now we have obtained the updating rules. It is not necessary to tune the learning rate. Apparently, 
( 2.23 ) and ( 3.12j ) look complicated. They are, however, easily implemented by the numerical tools 



like MatLab. (The source codes will be available from our Web-site. ) Starting from Co, Cj 
for positive i is determined by the left multiplication by e Ai , where A is determined by setting 
Y = C i . 1 X,i.e, 

C t = e At e A '- 1 e A *- 2 • • • e Al C . (4.1) 
If A becomes saficiently small, we can stop the iteration and exit the process. 



5 Second order convergence 

First, we will take over the notations in Section 0. The following discussion is, however, valid 
for the algorithm in Section [|| if we substitute the quantities /, W, and so on by their boldface 
counterparts. Let us start this section by introducing some additional notations. We set 

G G GL(N, R) (5.1) 
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and 

K € GL(l, R)® N . (5.2) 
We also define the coset space K\G by introducing the equivalence relation 

g'g- 1 £K^g~g> (5.3) 

to G. That is, K\G = {Kg\g E G}. Our method is understood as an orthodox adaptation of the 

Newton method to this coset space K\G. Note that the cost function F(-) = f /(•, Y) on G satisfies 
the relation 

F(g) = F[Kg) . (5.4) 

So F is naturally considered as a function on K\G. That is the reason of our choice for the cost 
function. Thus, the second-order convergence immediately follows if the the correction to the error 
with respect to the coordinating resulting from the multiplicative nature is properly evaluated. 
At time t, a point g on K\G is specified by the coordinate X^\g) € m such that 



C t ~g, (5.5) 



where m is the set of N x N matrices whose diagonal elements are zeros. Actually, this state- 
ment itself is not a thing of course, for which the proof will be given elsewhere. Define Ft, the 
representation of the cost function at t, by 

F t (X) = F(e x C t ) . (5.6) 

Here we introduce an (N 2 — N) x iV 2 matrix P by drawing out the i + N(i — l)-th raws from the 
unit A^ 2 x iV 2 matrix where i = N, N — 1, • • ■ ,2,1. We will denote by the Hessian, 

„<;> = , ggw ( 5.7) 

d(Pcs(X)) t d(Pcs(X)), 



Note that if we set 

h,(X)-T[ /. P)W(Iy, - P)+P 

the Hessian is written as 



(5.8) 

C=e x C t 



ffW = Ph t P' . (5.9) 

Suppose that at some neighborhood of the optimal solution g*, H®(X) is Lipschitz continuous for 
some t: 

\\hM(X) - H {t \X')\\ <L\\X- X'\\ , (5.10) 
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where is the norm of a matrix A as the Euclidian space, 

1 1 A|| 2 =tr(AAt) . 

We set 

P=\\H( t Xx\g*))- 1 \\ . 
There exists a positive real number r, for which 

| fo))- 1 1 1 < 2/? for € B^(g^r) ^ | 5 



r > \\X\g) - X\g*)\ 



is satisfied. Then it is known that for all g € #(<?*, min(r, (2 PL) 1 )), 

\\X\C t+l ) - X\g*)\\ < 0L\\X\C t ) - X\g*)\\ 2 

and 

\\X\C t + l )-X\g*)\\<\\\X t (C t )-X\g»)\\ 



(5.11) 
(5.12) 

(5.13) 

(5.14) 
(5.15) 



are fulfilled. Thus the second order convergence in this norm is shown. Unfortunately, this norm 
is not invariant and is unnatural. (A natural metric on K\G is one which is invariant under the 
parallel transformation, which is induced by the action of elements in K\G from the right-hand 
side.) But, it suffices in practice. 



6 Discussions 
6.1 Nonholonomy? 

Our method is related to the nonholonomic method by Amari, Chen, and Chichocki| Amari et al, 1997|1 



In essence our method is a Newton approach to the same problem, the optimization without 
prewhitening. Let us set 

e z = e x e y (6.1) 

for x,y € gl(N,R). Then it is obvious that z does not necessarily belongs to m even if x, y & m(, 
that is, ZiiS do not always vanish when xa = ya = for 1 < i < N). This may be explained 
by using the concept of nonholonomy. The degree of freedom in each step, however, equals the 
dimension of the space K\G in our setting. The nonholonomic nature emerges when we go back to 
G = GL(N, R) again. 

There exist several studies} M.Takeuchi, 1994 , 5.Helgason,1978| , [S.Helgason,1962j , |S.Hclgason,1984 



T.Akuzawa fc M.Wadati,1998 | which deal with cosets like K\G or the right coset G/K when K is 
a maximal compact subgroup of G. Unfortunately, what we are studying is the case where K is not 
a maximal compact subgroup of G. So, for example it is necessary to show whether the coordinate 
Q5.5| ) is justified or not. As mentioned above, further studies including this justification will appear 
elsewhere. 
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6.2 Global convergence 



We should carefully treat first few steps since this method also has a somewhat undesirable 
global convergent property inherent in the Newton method. Fortunately enough, there ex- 
ist methods which can handle the earlier stage. For example, the nonholonomic gradient 



method | Amari et al, 1997] may be applicable. Another posiibility is to construct a nonholonomic 
fixed-point algorithm which uses the kernel method. These methods are suitable for capturing the 
optimal point which contains components with zero kurtoses. There we must, of course, use the 
method in Section ||. If it is not necessary to worry about these zero kurtosis components, there is 
little difference between the two methods described in Section ^ and Section |||. 

6.3 Conclusions 

We have constructed a new algorithm for finding a optimal point in a matrix space, where we 
have introduced a new multiplicative updating method. The algorithm is in essence the Newton 
method on a coset. So it converges quite rapidly and it can capture the saddle point. Since it 
does not require prewhitening, it is not necessary to worry about the error resulting from the 
prewhitening. Indeed, it is possible to increase the kurtosis slightly for data preprocessed by the 
FastlCAfHiirri et a/.,1998||. 
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appendix 

A proof of ( ggg ) 

For B G GL(N, F) and 1 < i,j < N, 

[T(X ® F)Tcs(S)]i +JV(i -i) = [(X ® F)Tcs(S)] i+iV(l -_ 1) 

= X^^B')^ = (y^'X')^ . (A.l) 

On the other hand 

[(y ® x) cs (s)] i+iV(i _ 1) = r iP x^ p = (yfl'x')* . (a.2) 

This proves the statement since cs is bijective. □ 
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